In [13]:
import pandas as pd

Data Science on
Software Data

Markus Harrer, Software Development Analyst

@feststelltaste

Visual Software Analytics Summer School, 18 September 2019

About Me

In the past

  • Bachelor student
  • Researcher
  • Software developer
  • Master student*
  • Master's degree candidate*
  • Application developer

*and househusband

Now

My Motivation for Data Analysis in Software Development

The current problem in the industry

The current problem in the industry

"Software Analytics" to the rescue?

Definition Software Analytics

"Software Analytics is analytics on software data for managers and software engineers with the aim of empowering software development individuals and teams to gain and share insight from their data to make better decisions."

Tim Menzies and Thomas Zimmermann

Which kind of Software Data do we have?

  • static
  • runtime
  • chronological
  • Community

=> a great variety!

My problem with classic Software Analytics

My problem with classic Software Analytics

My problem with classic Software Analytics

My problem with classic Software Analytics

My problem with classic Software Analytics

Some analysis tasks from practice

  • Communicating negative performance implications of complex data models
  • Spotting concurrency problems in custom-built frameworks
  • Identifying performance bottlenecks across different software systems
  • Making lost knowledge visible due to turnover
  • Analyzing the health of a open source community

"It depends" aka "context matters!"

Individual systems == individual problems => individual analyses => individual insights!

Others see that problem, too

Thomas Zimmermann in "One size does not fit all":

"The main lesson: There is no one size fits all model. Even if you find models that work for most, they will not work for everyone. There is much academic research into general models. In contrast, industrial practitioners are often fine with models that just work for their data if the model provides some insight or allows them to work more efficiently."

But: "... the methods typically are applicable on different datasets." => we see what's possible!



Data Science on Software Data:

A Lightweight Implementation of Software Analytics

Data Science

What is Data Science?

"Statistics on a Mac."

https://twitter.com/cdixon/status/428914681911070720

Data Science Venn Diagram (Drew Conway)

My Definition

What does "data" mean for me?

"Without data you‘re just another person with an opinion."

W. Edwards Deming

=> Delivering credible insights based on facts.

What does "science" mean for me?

"The aim of science is to seek the simplest explanations of complex facts."

Albert Einstein

=> Working out insights in a comprehensible way.

Why Data Science at all?

High demand in data analytics

Young job positions are paid well...

Data from Stack Overflow Developer Survey 2019

... but also demanding?

Data from Stack Overflow Developer Survey 2019

"Who's Actively Looking for a Job?" (Top 5)

Big and supportive community

  • Free online courses, videos and tutorials (e. g. DataCamp with > 4.6M members)
  • Online communities that help each other (e. g. Stack Overflow)
  • Online competitions to improve own skills (e. g. Kaggle)

Free and easy to use tools!

"R is for statisticians who want to program, Python is for developers who want to do statistics."

Data Science popularity is still growing!


In [ ]:

"100" == max. popularity!

How far away are Software Engineers</b> from Data Science?

What is a Data Scientist?

"A data scientist is someone who
  is better at statistics
  than any software engineer
  and better at software engineering
  than any statistician."

From https://twitter.com/cdixon/status/428914681911070720

Not so far away as you may have thought!

How to Get Started?

Reuse a Proven Approach (~ scientific method)

Roger Pengs "Stages of Data Analysis"
I. Stating Question
II. Exploratory Data Analysis
III. Formal Modeling
IV. Interpretation
V. Communication

=> from a question over data to insights!

Be Aware of the "Seven principles

...of inductive software engineering" (Tim Menzies)

  1. Human before algorithms
  2. Plan for Scale
  3. Get Early Feedback
  4. Be Open Minded
  5. Be Smart with Your Learning
  6. Live with the Data You Have
  7. Develop a Broad Skill Set That Uses a Big Toolkit

Use Literate Statistical Programming

(Intent + Code + Data + Results)
* Logical Step
+ Automation
= Literate Statistical Programming


Approach: Computational notebooks

Computational Notebook Example


Use Standard Data Science Tools

  • Jupyter Notebook
  • Python 3
  • pandas
  • matplotlib

Jupyter Notebook

Interactive Notebook

  • Document-based analyses
  • Executable Code
  • Displaying results immediately
  • Everything in one place
  • Every step to the solution visible

=> Working out results in a comprehensible way!

Python 3

Best programming language for Data Science!

  • Easy
  • Effective
  • Fast
  • Fun
  • Automation

=> Data Analysis becomes repeatable

pandas

Pragmatic data analysis framework

  • Tabular data structures ("programmable Excel sheet")
  • Really fast
  • Flexible
  • Expressive

=> Good integration point for your data sources!

matplotlib

Programmable visualization library

  • Programmatic creation of graphics
  • Plots line charts, bar charts, pie charts and much more
  • Integrated into pandas

=> Direct visualization of results in Jupyter Notebooks

The Python ecosystem


Data Analysis
  • NumPy
  • scikit-learn
  • TensorFlow
  • SciPy
  • PySpark
  • py2neo
Visualization and more
  • pygal
  • Bokeh
  • python-pptx
  • RISE
  • Requests, xmldataset, Selenium, Flask...

=> Provides the flexibility that is needed in specific situations

Other Technologies

Jupyter Notebook works also with other technological platforms e. g.

  • jQAssistant software scanner / Neo4j graph database
  • JVM-based languages via beakerx / Tablesaw
  • bash

=> If you want to use special technology, you can!

Anaconda 3

Data Science Python Distribution

  • Free all-inclusive package
  • Brings everything you need to get started
  • Optimized for running fast on your operating system

=> Download, install, ready, go!

My Recommendations for an easy start

My TOP 5's*

https://www.feststelltaste.de/category/top5/

Courses, videos, blogs, books and more...

**some pages are still under development*

My Book Recommendations

  • Adam Tornhill: Software Design X-Ray
  • Wes McKinney: Python For Data Analysis
  • Jeff Leek: The Elements of Data Analytic Style
  • Tim Menzies, Laurie Williams, Thomas Zimmermann: Perspectives on Data Science for Software Engineering

Hands-On

Programming Demo

Case Study

IntelliJ IDEA

  • IDE for Java developers
  • Almost entirely written in Java
  • Big and long-living project

I. Stating Question (1/3)

  • Write down your question explicitly
  • Explain analysis idea comprehensibly

I. Stating Question (2/3)

Question

  • Which code is complex and did change often lately?

I. Stating Question (3/3)

Implementation Idea

  • Tools: Jupyter, Python, pandas, matplotlib
  • Heuristics:
    • "complex": many lines of code
    • "change often": number of Git commits
    • "lately": last 30 days

Meta goal: Get to know the basic mechanics of the stack.

II. Exploratory Data analysis

  • Load and explore possible data sets
  • Clean up and filter the raw data

We load Git log dataset extracted from a Git repository.


In [ ]:

We explore some basic key elements of the dataset


In [ ]:

1 DataFrame (~ programmable Excel worksheet), 6 Series (= columns), 1128819 rows (= entries)

We convert the text with a time to a real timestamp object.


In [ ]:

We filter out older changes.


In [ ]:

We keep just code written in Java.


In [ ]:

III. Formal Modeling

  • Create new perspective on the data
  • Join data with other datasets

We aggregate the rows by counting the number of changes per file.


In [ ]:

We add additional information about the number of lines of all currently existing files...


In [ ]:

...and join this data with the existing dataset.


In [ ]:

VI. Interpretation

  • Work out the essence of the analysis
  • Make the central message / new insight clear

We show only the TOP 10 hotspots in the code.


In [ ]:

V. Communication

  • Transform insights into a comprehensible visualization
  • Communicate the next steps after the analysis

We plot the TOP 10 list as XY diagram.


In [ ]:

End of Demo

Further Analysis

  • Analysis of performance bottlenecks with data from vmstat
  • Identifying Modularization Options based on Code Changes
  • Dependency Analysis with data from jdeps and visualization with D3

Summary

1. Software Analytics with Data Science is possible!
2. If you need to go into deeper analysis: you can!
3. There are many data sources in software development. What are you waiting for?

=> from a question over data to insights!

Thanks! Questions?

Markus Harrer
innoQ Deutschland GmbH

markus.harrer@innoq.com

@feststelltaste